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Abstract 

Gaussian factor models have proven widely useful for parsimoniously char- 
acterizing dependence in multivariate data. There is a rich literature on their 
extension to mixed categorical and continuous variables, using latent Gaussian 
variables or through generalized latent trait models acommodating measure- 
ments in the exponential family. However, when generalizing to non-Gaussian 
measured variables the latent variables typically influence both the dependence 
structure and the form of the marginal distributions, complicating interpretation 
and introducing artifacts. To address this problem we propose a novel class of 
Bayesian Gaussian copula factor models which decouple the latent factors from 
the marginal distributions. A semiparametric specification for the marginals 
based on the extended rank likelihood yields straightforward implementation 
and substantial computational gains, critical for scaling to high- dimensional ap- 
plications. We provide new theoretical and empirical justifications for using this 
likelihood in Bayesian inference. We propose new default priors for the factor 
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loadings and develop efficient parameter-expanded Gibbs sampling for posterior 
computation. The methods are evaluated through simulations and applied to 
a dataset in political science. We also provide bfa, an easy-to-use R package 
leveraging compiled code to fit the models in this paper. 1 

Keywords: Factor analysis; Latent variables; Semiparametric; Extended rank likelihood; 
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1 Introduction 

Factor analysis and its generalizations are powerful tools for analyzing and exploring 
multivariate data, routinely used in applications as diverse as social science, genomics 
and finance. The typical Gaussian factor model is given by 

y i = Ar) l + e i (1.1) 

where yi is a p x 1 vector of observed variables, A is a p x k matrix of factor loadings 
(k < p), t]i ~ N(0,I) is a x 1 vector of latent variables or factor scores, and 
e, ~ iV(0, E) are idiosyncratic noise with S = diag(of, . . . , aT). Marginalizing out 
the latent variables, yi ~ N(0, AA' + S), so that the covariance in y^ is explained 
by the (lower-dimensional) latent factors. The model in (1.1) may be generalized 
to incorporate covariates at the level of the observed or latent variables, or to allow 
dependence between the latent factors. For exposition we focus on this simple case. 

This model has been extended to data with mixed measurement scales, often by 
linking observed categorical variables to underlying Gaussian variables which follow 
a latent factor model. Muthen (1983) used this idea to formulate a linear structural 
components (LISCOMP) modeling framework. An alternative is to include shared 

1 Available from http://stat.duke.edu/~jsm38/software/bfa 
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latent factors in separate generalized linear models for each observed variable (Dunson, 
2000, 2003; Moustaki and Knott, 2000; Sammel et al., 1997). Unlike in the Gaussian 
factor model the latent variables typically impact both dependence and the form of 
the marginal distributions. For example, suppose y, t = (jfa,,?/^)' are bivariate counts 
assigned Poisson log-linear models: logE^j | r/i) = fij + Xr/i. Then A governs both the 
dependence between yn, y i2 and the overdispersion in each marginal distribution. This 
confounding can lead to substantial artifacts and misleading inferences. Additionally, 
computation in such models is difficult and requiring marginal distributions in the 
exponential family can be restrictive. 

There is a growing literature on semiparametric latent factor models to address the 
latter problem. A number of authors have proposed mixtures of factor models (Chen 
et al., 2010; Ghahramani and Beal, 2000). Song et al. (2010) instead allow flexible 
error distributions in Eq. (1.1). Yang and Dunson (2010) proposed a broad class of 
semiparametric structural equation models which allow an unknown distribution for 
r)i. When building such flexible mixture models there is a sacrifice to be made in 
terms of interpretation, parsimony and computation, and subtle confounding effects 
remain. It would be appealing to retain the simplicity, interpretability and computa- 
tional scalability of Gaussian factor models while allowing the marginal distributions 
to be unknown and free of the dependence structure. 

To accomplish these ambitious goals we propose a semiparametric Bayesian Gaus- 
sian copula model utilizing the extended rank likelihood of Hoff (2007) for the marginal 
distributions. This approximation avoids a full model specification and is in some sense 
not fully Bayesian, but in practice we expect that this rank-based likelihood discards 
only a mild amount of information while providing robust inference. An additional 
contribution of this paper is to provide new theoretical and empirical justification for 
this approach. 
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We proceed as follows: In Section 2, we propose the Gaussian copula factor model 
for mixed scale data and discuss its relationship to existing models. In Section 3 we 
develop a Bayesian approach to inference, specifying prior distributions and outlining 
a straightforward and efficient Gibbs sampler for posterior computation. Section 4 
contains a simulation study, and Section 5 illustrates the utility of our method in a 
political science application. Section 6 concludes with a discussion. 

2 The Gaussian copula factor model 

A p-dimensional copula C is a distribution function on [0, 1} P where each univariate 
marginal distribution is uniform on [0, 1]. Any joint distribution F can be completely 
specified by its marginal distributions and a copula; that is, there exists a copula C 
such that 



where Fj are the univariate marginal distributions of F (Sklar, 1959). If all Fj are 
continuous then C is unique, otherwise it is uniquely determined on Ran(Fi) x • • • x 
Ran(Fp) where Ran(Fj) is the range of Fj. The copula of a multivariate distribution 
encodes its dependence structure, and is invariant to strictly increasing transformations 
of its univariate margins. Here we consider the Gaussian copula: 



C( Ul ,...u p ) = $ p ($- 1 ( Ml ),...,$- 1 (u p ) | C), u p ) G [0,1} P (2-2) 



where $ P (-|C) is the p-dimensional Gaussian cdf with correlation matrix C and $ is 
the univariate standard normal cdf. Combining (2.1) and (2.2) we have 



F(y 1 ,...,y p ) = C(F l (y 1 ),...,F p (y p )) 



(2.1) 



F( Vl , ...,%,) = ^~\F 1 (y 1 )), . . . , <$>-\F p {y p )) \ C) 



(2.3) 
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From (2.3) a number of properties are clear: On marginalizing out a subset of y the 
joint distribution of the remaining variables has a Gaussian copula with correlation 
matrix given by the appropriate submatrix of C, and yj _LL yy if and only if Cjj> = 
0. When Fj, Fy are continuous, c,y = Corr($~ 1 (F J (|/ J ), which is an 

upper bound on Corr(yj, yji) (attained when the margins are Gaussian) (Klaassen and 
Wellner, 1997). The rank correlation coefficients Kendall's tau and Spearman's rho 
are monotonic functions of Cjj> (Hult and Lindskog, 2002). For variables taking finitely 
many values Cjf gives the polychoric correlation coefficient (Muthen, 1983). 

If the margins are all continuous then zeros in R = C -1 imply conditional indepen- 
dence, as in the multivariate Gaussian distribution. However this is generally not the 
case when some variables are discrete. Indeed, in the simple case where p = 3, Y 3 is 
discrete and C13C23 7^ 0, if r 12 = then Yj and Y 2 are in fact dependent conditional on 
I3 (a similar result holds when conditioning on several continuous variables and a sin- 
gle discrete variable as well - see Appendix A). Results like these suggest that sparsity 
priors for R in Gaussian copula models (e.g. Dobra and Lenkoski (2011); Pitt et al. 
(2006)) are perhaps not always well- motivated when discrete variables are present, and 
should be interpreted only with great care. 

From (2.3) we see that a Gaussian copula model can be expressed in terms of latent 
Gaussian variables z which define y through a transformation: Let Fj~ (t) = inf{t : 
Fj(y) > 6 R} be the usual pseudo- inverse of Fj and suppose O is a covariance 
matrix with C as its correlation matrix. If z ~ N(0, fi) and jjj = F~ 1 ((^(zj/^yujj)) for 
1 < j < P then F(y) has a Gaussian copula with correlation matrix C and univariate 
marginals Fj. We utilize this representation to generalize the Gaussian factor model 
to Gaussian copula factor models by assigning z a factor model: 

77,-^(0,1), Zi\rji ~ N(Ar)i, I) 
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Inference takes place on the scaled loadings 

\ jh = \ (2.5) 

so that Cjf = Ylh=i ^jh^j'h- Rescaling is important as the factor loadings are not 
otherwise comparable across the different variables, even though (2.4) is technically 
identified. Hence we will be concerned in the sequel with priors on Xjh induced by 
various priors on Xj h . Another quantity of interest is the uniqueness of variable j, 
given by 

«,- = l-E%= i.Jk A2 (2-6) 
h=i 1 2^h=i A jh 

In the Gaussian factor model the analogous quantity is cj/(er| + Ylh=i tfh) an d rep- 
resents the proportion of variability in the manifest variable which is unexplained by 
common factors. In the Gaussian copula factor model this precise interpretation doesn't 
hold, but Uj still represents a measure of dependence on common factors. In a Bayesian 
specification of the Gaussian factor model a prior is induced on Uj through the priors 
on S and A. In our model (and other factor models where the scale is constrained) 
the prior on Uj is induced solely by the prior on A. In Section 3 we show that this has 
meaningful implications for prior specification. 

2.1 Relationship to existing factor models 

The Gaussian copula factor model subsumes many existing models, including the Gaus- 
sian factor model (where Fj are all univariate Gaussian) and probit factor models. 
Probit factor models for binary or ordered categorical data parameterize each mar- 



gin by a collection of "cutpoints" 7^0, . . . 7j C . (with jj — 00 and 7 JCj . = 00) so that 
Fj(c) = $ (jjcO- + ^2f=i ^jh}~ X j- Then Fj has the pseudoinverse 

Plugging this into (2.4) and simplifying gives = X^ c =i c l(7jc-i < % < 7jc) where 
Zj ~ iV(0, A A' + I), the data augmented representation of an ordinal probit factor 
model. Naturally the connection extends to mixed Gaussian/probit margins as well. 

Our model generalizes these to any collection of ordered variables while preserving 
the Gaussian factor model's fundamental dependence structure, encoded in its (Gaus- 
sian) copula. This is a distinction between our model and semiparametric factor models 
which assume non-Gaussian latent variables r)i or errors e*. These instead retain the 
linear model formulation (1.1), so that marginally Cov(yi) = ACov(r)i)A' + S (with S 
diagonal and crj = Vor(ey)). But F(yi) no longer has a Gaussian copula, and since the 
joint distribution is no longer Gaussian or even elliptically symmetric the covariance 
matrix is unlikely to be an adequate measure of dependence. Further, the dependence 
and marginal distributions are typically confounded since the implied correlation ma- 
trix will most likeliy depend on the parameters of the marginal distributions. 

However, in the Gaussian copula factor model A governs the dependence separately 
from the marginal distributions, representing a factor-analytic decomposition for the 
scale-free copula parameter C rather than Cov{yi). The Gaussian copula model is 
also invariant to strictly monotone transformations of observed variables and therefore 
consistent with the common assumption that there exist monotonic functions hi, . . . , h p 
such that (hi(yi), . . . hp(y p ))' follows a Gaussian factor model. Existing semiparametric 
approaches are not. 
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2.2 Marginal Distributions 

There are a number of ways to treat the marginal distributions in a copula model. An 
obvious choice is to specify a parametric family for each margin and infer the parameters 
simultaneously with C (see e.g. Pitt et al. (2006) for a Bayesian approach). This is 
computationally expensive for even moderate p, and there is often no obvious choice 
of parametric family for every margin. Since our goal is not to learn the whole joint 
distribution but rather to characterize its dependence structure we would prefer to treat 
the marginal distributions as nuisance parameters. When the data are all continuous 
a popular semiparametric method is a two-stage approach in which an estimator Fj is 
used to compute "pseudodata" % = ^~ 1 {Fj{y i j)), which are treated as fixed to infer the 
copula parameters. A natural candidate is Fj(t) = Y^l=i \^-{Vv — the ( sca l e d) 
empirical marginal cdf. Klaassen and Wellner (1997) considered such estimators in the 
Gaussian copula and Genest et al. (1995) developed this method in the general case. 

Inference for the copula based on marginal ranks is appealing because of their 
invariance properties when the data are continuous. However, this method cannot 
handle discrete margins (Hoff, 2007). To accommodate mixed discrete and continuous 
data Hoff (2007) proposed an approximation to the full likelihood called the extended 
rank likelihood. Since the transformation y^ = F~ l (<&(zij)) is nondecreasing, when we 
observe yj = (yy, . . . , y n j) we also observe a partial ordering on Zj = (z±j, . . . z n j). To 
be precise we have that 

Zj e D(y 3 ) = {Zj e R n : yij < y Vj < z Vj ) (2.7) 

The set D(yj) is just the set of possible Zj = (z±j, . . . , z n j) which are consistent with 
the ordering of the observed data on the j th variable. Let D{Y) = {Z G W xn : Zj e 
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D(jjj) V 1 < j < p}. Then we have 

P(Y\C, F 1 ,... F p ) = P(Y, Z G D(Y)\C, Ft,... F p ) 

= P(Z G D(Y)\C) x P(Y\Z G D(Y),C,Ft,...F p ) (2.8) 

where (2.8) holds because given C the event Z G D(Y) does not depend on the 
marginal distributions. Hoff (2007) proposes dropping the second term in (2.8) and 
using P(Z G D(Y)\C) as the likelihood. Intuitively we would expect the first term to 
include most of the information about C, and simulations in Section 4 provide further 
supporting evidence. Hoff (2007) shows that when the margins are all continuous the 
marginal ranks satisfy certain relaxed notions of sufficiency for C (although these fail 
when some margins are discrete). Very recently Hoff et al. (2011) further studied the 
information loss under the rank likelihood in continuous data, with encouraging results. 
But most interesting applications involve mixed data, where existing theoretical results 
do not hold. We give a new of proof of strong posterior consistency for C under the 
extended rank likelihood with nearly any mixture of discrete and continuous margins 
(barring pathological cases which preclude identification of C). Note that posterior 
consistency will generally fail under Gaussian/probit models when any margin is mis- 
specified. A similar result for continuous data and a univariate rank likelihood was 
obtained by Gu and Ghosal (2009). We replace Y with for notational clarity. 

Theorem 1. Let U be a prior distribution on the space of all positive semidefinite 
correlation matrices C with corresponding density tt(C) with respect to a measure v . 
Suppose 7r(C) > a. e. [v] and that Ft, . . . , F p , are the true marginal cdfs. Then for 
Co a.e. [v] and any neighborhood A of Cq we have that 

hm H(C G A | ZW G D(yW)) = 1 a.8. [G^ Fl _ Fp ] (2.9) 
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where G, 



C ,Fi,...,F t 



'oo 



is the distribution of {fji}^ under C , Fx, . . . F p . 



The proof is in Appendix B. We assumed a prior 7r(C) having full support on 
C. Under factor-analytic priors fixing k < p restricts the support of tt, and posterior 
consistency will only hold if Cq has a factor analytic decomposition in k or fewer factors. 
But by setting k large (or inferring it) it is straightforward to define factor-analytic 
priors which have full-support on C (further discussion in Section 6). In practice, many 
correlation matrices which do not exactly have a fc-factor decomposition are still well- 
approximated by a fc-factor model. Finally, the argument also applies to posterior 
consistency for A if k is chosen correctly, given compatible identifying restrictions. 

3 Prior Specification and Posterior Inference 
3.1 Prior Specification 

Choosing a good prior on A is paramount, especially since we are interested in cases 
where n is not large relative to p and in simultaneously accommodating uncertainty in 
marginal distributions. First we recall that the factor model is invariant under rotation 
or scaling of the loadings and scores. We assume that either sufficient identifying 
conditions are imposed (such as restricting A to have zeros above the diagonal and 
positive entries along the diagonal (Geweke and Zhou, 1996)) or that inference is on 
C, which does not suffer from this indeterminacy. A natural choice of prior for the 
unrestricted factor loadings is Xjh ~ N(0,l/b) independently. We can derive the 
implied distribution on Uj analytically: 



Figure 1 shows this density for b = 0.5,1,4 and K = 1,3,5. Even the "less- 




(3.1) 
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Figure 1: Comparison of the density in (3.1) 



informative" priors on A are quite informative on the uniquenesses, especially as k 
increases. Even when k is small these priors are often quite informative on the individ- 
ual scaled loadings (for example, see Fig. 4 for the k = 1 case). The problem is that 
the normal prior puts insufficient mass near zero. Coupled with the normalization this 
results in a "smearing" of mass across the columns of A, deflating Uj and potentially 
inducing spurious correlations. Hence the normal prior is a reasonable default choice 
only if n is large, if the loadings matrix A has many fixed zeros, or if k is very small and 
b is chosen carefully. Otherwise this prior is too restrictive, favoring less parsimonious 
models. 

Shrinkage priors which place significant mass at or near zero alleviate this problem 
since a limited number of terms contribute meaningfully to the sum Ylh=i ^jh- Such 
parsimony in A is also desirable in its own right since the factors are easier to inter- 
pret. Shrinkage priors have been thoroughly explored in the regression context (see e.g. 
Poison and Scott (2010) and references therein). As priors on regression coefficients 
heavy-tailed distributions are desirable. While somewhat heavy tails might be appeal- 
ing here (so that ^(A^) decays slowly to zero as \Xjh\ — > 1), extremely heavy tails are 
inappropriate. If 7r(Ajfc) has very heavy tails then with significant probability a single 
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unsealed loading (say A jm ) in a row j will dominate the others so that Xjh ~ Xjh/\Xj m \ 
for 1 < h < k. The resulting joint prior on Xj = (Xji, ■ ■ ■ , Xjk) will assign undesirably 
high probability to vectors with one entry near ±1 and the rest near 0, yielding corre- 
lations which are approximately or ±1. Therefore applying these priors in this new 
setting requires some extra care. 

As a default choice we recommend the generalized double Pareto (GDP) prior of 
(Armagan and Dunson, 2011) which has the density 

n, / IX |\-( Q+1 ) 

which we will refer to as Xjh ~ GDP(a, 0). The GDP is a flexible generalization 
of the Laplace distribution with a sharper peak at zero and heavier tails. It has the 
following scale-mixture representation: \jh\ipjh ~ ^(0, i>jh), ^jhl^jh ~ Exp(^ h /2), and 
^jh ~ Ga(a, (5) which leads to conditional conjugacy and a simple Gibbs sampler. The 
GDP's tail behavior is determined by a, and /3 is a scale parameter. Armagan and 
Dunson (2011) handle the hyperparameters by either fixing them both at 1 or assigning 
them a hyperprior. For the purposes of exposition we will fix them here. Taking a = 3, 
(3 = 1 is a good default choice: The GDP(3, 1) distribution has mean and variance 1, 
and Pr(\\jh\ < 2) w 0.96. Critically, taking a = 3 leads to tails of ir(Xjh) light enough 
to induce a sensible prior on Xjk. Figure 2 shows draws from the implied prior on Uj 
and Xjh under the GDP(3, 1) prior. 

The GDP parameters can also be chosen to induce a larger spike at and encourage 
sparsity. But often it is desirable to include positive probability at zero in ir(Xjk), in 
which case a spike-and-slab formulation may be used instead (see e.g. Carvalho et al. 
(2008); West (2003)). In this case the induced prior on Uj is a mixture of the densities 
in (3.1) which is generally flatter than any of its components and has an asymptote 
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Figure 2: Priors on the scaled loadings and uniquenesses implied by N(0,1) and GDP(3,1) 
priors on the loadings 

and point mass at Uj — 1. For brevity we do not explore them here, but similar priors 
are available in the accompanying R package. 

3.2 Parameter-Expanded Gibbs Sampling 

For efficient MCMC inference we introduce a parameter-expanded (PX) version of 
our original model. The PX approach (Liu and Wu, 1999; Meng and Van Dyk, 1999) 
adds redundant (non-identified) parameters to reduce serial dependence in MCMC and 
improve convergence and mixing behavior. Naive Gibbs sampling in our model suffers 
from high autocorrelation due to strong dependence between Z and A. We modify 
(2.4) by adding scale parameters V = diag^vf, . . . ,Vp) to weaken this dependence: 



Wi ~ N (V 1/2 Ar? 4 , V) 



Vij I) ; I <I> 



w 



ij 



(3.3) 
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Since Wy/vj and are equal in distribution (3.3) is observationally equivalent to the 
original model. We assume that V is independent of the inferential parameters a priori 
so that tt(A, H, V\Y) = 7r(A, H\Y)tt(V) (where H' is the n x k matrix with entries 
rjik) and the posterior distribution of the inferential parameters is unchanged. 

We choose the conjugate PX prior 1/v? ~ Ga(n /2,no/2) (independently). The 
greatest benefits from PX are usually realized when the PX prior becomes is most 
diffuse, which would imply sending no — > and an improper PX prior. Since V is 
unidentifiable the posterior for (A, H , V) is also improper, but we can show that even 
under this improper prior the samples of (A, H) from the corresponding Gibbs sampler 
have the desired stationary distribution 7r(A, H\Y) (Appendix C). The Gibbs sampler 
is implemented as follows: 

PX parameters: Draw 1/v? ~ Ga(n/2, Sj/2) where ^ = diag(?/>|i/2, . . . ,ip? h ./2) 
and 8j = Zj (I - H'iVj 1 + HjH'-)~ x Hj)z'j 

Factor Loadings: We assume a lower triangular loadings matrix with a positive 
diagonal; the extension to other constraints is straightforward. Let kj = max(fc, j) and 
Hj be the n x kj matrix with entries rjik for 1 < k < kj and 1 < i < n. Update 
nonzero elements in row j of A as A^- ~ N(X'j/vj, (^J 1 + HjH'^ 1 ) where \j = 
(^J 1 + HjH'j)~ 1 HjZj and \jj is restricted to be positive if j < k. 

Hyperparameters: Update (l/ipjh\— ) ~ InvGauss(\£,jh/ \jh\, £,jh) an d (£,jh\— ) ~ 
Ga(a + 1, (3 + \Xjh\) where InvGauss(a,b) is the inverse-Gaussian distribution with 
mean a and scale b. 

Factor scores: Draw rji from (t^— ) ~ N ([A' A + J] _1 A'zi, [A' A + J] -1 ) 

Augmented Data: Update Z elementwise from 

( Zij \-) ~ TN ^hVki, 1, 4, ^ j (3.4) 
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where TN(m, v, a, b) denotes the univariate normal distribution with mean m and vari- 
ance v truncated to (a, b), z\j = maxj^/j : y^j < y^} and = minj^/j : y^j > y^}. 
If y^ is missing then (%-|— ) ~ ^{Yuh=i ^jhVkh !)■ Note that (3.4) doesn't require 
a matrix inversion since (z^ JL z^-/ | A, 77^, V) for j 7^ j', a unique property of our 
factor analytic representation and a significant computational benefit as p grows. A 
potential issue here is that for continuous margins the bounds in (3.4) become tight 
as n increases, and the dependence on the adjacent augmented data points inhibits 
mixing. In our experience the performance of this sampling scheme (which we term 
"PX-Gibbs" ) is more than adequate for continuous data with n well into the hundreds. 
If problems do arise they can be addressed with Metropolis-Hastings steps or by adopt- 
ing the "pseudodata" approach for continuous variables, maintaining the copula model 
interpretation while ignoring only a minimal amount of uncertainty for large n. 

The PX-Gibbs sampler has mixing behavior at least as good as Gibbs sampling 
under the original model (which fixes V = I) (Liu and Wu, 1999; Meng and Van Dyk, 
1999), and the additional computation is negligible. The PX-Gibbs sampler often 
increases the smallest effective sample size (associated with the largest loadings) by 
an order of magnitude or more in both real and synthetic data. The improved mixing 
is also vital for the multimodal posteriors sometimes induced by shrinkage priors. To 
our knowledge this is the first application of PX to factor analysis of mixed data, but 
PX has previously been applied to EM and Gibbs sampling in Gaussian factor models 
(Ghosh and Dunson, 2009; Liu et al., 1998). In that case additional scale parameters for 
rji are introduced to reduce dependence between H and A. Since MCMC in our model 
suffers primarily from dependence between Z and A our approach is more appropriate. 
Hoff (2007) and Dobra and Lenkoski (2011) also use priors on unidentified covariance 
matrices to induce a prior on correlation matrices in Gaussian copula models. But the 
motivation there is to derive simple MCMC updates and the priors on C and V are 
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not independent a priori, precluding our strategy of choosing an optimal PX prior. 
3.3 Posterior Inference 

Given MCMC samples we can address a number of inferential problems. As in tradi- 
tional applications of factor analysis the observed variables may have been chosen to 
measure some latent traits. The posterior distribution of the factor scores r\i provide 
a measure of the latent variables for each data point, describing a projection of the 
observed data into the latent factor space, and the factors themselves are character- 
ized by the variables which load highly on them. Even if the factors are not directly 
interpretable this is a very useful exploratory technique for mixed-scale data which is 
robust to outliers and handles missing data automatically, unlike common alternatives 
such as principal component analysis. 

We can also do inference on conditional or marginal dependence relationships in yi. 
Here there is no need for identifying constraints in A, simplifying model specification. 
Tests of independence like H : Cjf < e versus Hi : Cjj> > e are simple to construct 
from MCMC output. When the variables are continuous the conditional dependence 
relationships are encoded in R = C _1 which we can compute as 

R = (AA/ + U)- 1 = U- 1 - U- x k[I + A'U^Xl^A'lf- 1 (3.5) 

Eq. (3.5) requires calculating only k- dimensional inverses, rather than p-dimensional 
inverses, a significant benefit of our factor-analytic representation. 

As discussed in Section 2 the presence of discrete variables complicates inference 
on conditional dependence. Additionally, two discrete variables may be effectively 
marginally independent even if \cjj>\ > simply by virtue of their levels of discretiza- 
tion. For these reasons, and for more readily interpretable results, it is often desirable 
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to report inferences at the level of the observed variables by considering aspects of 
the posterior predictive distribution n(y*\Y). Under our semiparametric model the 
posterior predictive is somewhat ill-defined, but we can sample from an approximation 
to n(y*\Y) by drawing A via the PX-Gibbs sampler, drawing z* ~ N(0, AA' + U) and 
setting y* = F~ (&(Zj)) where Fj are estimators of each marginal distribution (such 
as the empirical cdf). This disregards some uncertainty when making predictions; Hoff 
(2007) provides an alternative based on the values of z\j, zfj from (3.4) but (in keeping 
with his observations) we find both approaches to perform similarly. 

To sample from conditional posterior predictive distributions such as 7r(y* | y?_i) — 
x, Y) we could sample from n(y*\Y) and discard draws where y* ^ Xj for any 2 < j < 
p. This approach can be wasteful computationally since even in moderate dimensions 
most samples will be discarded. Instead we might prefer to estimate this distribution 



directly. We can write Pr(y* < y | = x, Y 



as 



J J j iz{zl\zl_ xv C)dzt 

C Rp-1 \-oo J 



tt^I y\_ x) = x,C)tt(C I Y)dz* ( _ 1) dC (3.6) 



Assume that y2,---U P are discrete, or that the empirical cdfs are used for Fj (if yj 
is continuous and Fj is a smooth estimator then z* = $ _1 (F(xj)) is fixed in the 
following). Then 7r(z? ^ | yf-x) = x i^) ls (p — l)-dimensional truncated normal 
distribution N(0, C(-i)) where Ct-u is obtained by dropping the first row and column 
of C, restricted to the set B x = {z* { w, ^(Fjixj)) < z* < ^(Fjfe)) V 2 < j < p} 
(where F(x~) is the lower limit of F at x). To estimate (3.6) from MCMC output we 
need to draw from this distribution (at least) once for every sample of C. For a general 
C this is prohibitive unless p is very small, but our factor-analytic representation allows 
us to efficiently draw from 7i(z*_^ | = x,C) by sampling (p — 1) univariate 
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truncated normals: Let A(_i) be A with the first row removed and t/(-i) be U with 
the first row and column removed. Since C(_i) = A(_i)A/_ 1 -j + U^ X ) we have 

<*U) I yU) = x ' a (-d) k n ( z U)' °. a (-d a (-i) + c/ (-d) 1 ((^i) e 

^ / niTNiXjfi,^,^,^)) N(r 1 -,0,I)dr } 

where a,j = §~ 1 (Fj(xj)), bj = Q~ 1 (Fj(xj)) and 77 is an auxiliary variable. Therefore 
we can approximate (3.6) as follows: 

1. Draw A via the PX-Gibbs sampler, and draw 77 ~ N(0, 1) 

2. Draw z* ~ TN(\jTj, Uj, a,j, bj) for 2 < j < p 

3. For each distinct value of y x set F^(yi) = J^ Vt ' N(zl;m,v) dz\ where 

m = \ x k { _ x) \k^ x yS! { _ x) + U { ^ x z\_ x) 

v = i- a 1 a;_ 1) [a ( _ 1) a;_ 1) + (3.7) 

where again the matrix inverses in (3.7) can be computed efficiently as in (3.5). This 
procedure provides estimates of the conditional cdf at the observed data points. For a 
discrete response we can then directly compute conditional probabilities, odds ratios, 
and so on. When y x is continuous these can be interpolated to give a histogram 
estimate of 7r(2/i = x) with support on the range of the observed data. A number 
of modifications to this approach are possible; for example, to condition on a subset of 
we simply drop the irrelevant rows of A(_i) and only perform step 3 for the j th 
variable if we are conditioning on yj. 

This is a natural extension of factor regression models which posit a Gaussian factor 
model for (?/j,a^)', implying a linear regression model for ir(yi | Xi) (Carvalho et al., 
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2008; West, 2003). These are especially useful when p > n as a model-based form 
of reduced rank regression (automatically selecting batches of correlated predictors by 
loading them highly on the same factor), or when there is missing data in X. Here 
we have a flexible joint model which accommodates any ordered response or covariates 
while retaining the computational simplicity of factor regression models. 

4 Simulation Study 

To study finite-sample behavior of the extended rank likelihood we compare the poste- 
rior mean correlation matrix using the Gaussian copula factor model with the extended 
rank likelihood to the posterior mean correlation matrix under 1) a Gaussian factor 
model, when the factor model is true and 2) a probit factor model, when the probit 
model is true. Both are special cases of the Gaussian copula factor model so we can 
make direct comparisons. In the probit case the copula correlation matrix is computed 
exactly as in the Gaussian factor model (as Corr(zi)). In the Gaussian factor model 
the copula correlation matrix is just Corr(yi). We use the GDP(3, 1) prior for in 
each case, taking 7r(er|) oc 1/erJ for the Gaussian factor model and assuming uniform 
priors on the cutpoints in the probit model. 

The true (unsealed) factor loadings were sampled iid GDP(3, 1). For the probit 
case each margin had five levels with probabilities sampled Dirichlet(l/ 2, ... ,1/2). 
We fix k at the truth; additional simulations (not reported here) suggest that the 
relative performance is similar under misspecified k. The cutpoints in the probit 
model were updated using independence Metropolis-Hastings steps with a proposal 
derived from the empirical cdfs. Using our R package we performed 100 replicates 
for each p/k/n combination in Table 1. Each model was fit using 60,000 MCMC it- 
erations after a 10,000 burn-in, keeping every 12th sample for a final MCMC sample 
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size of 5,000. MCMC diagnostics for a random sample of the fitted models indicated 
no convergence issues. We assess the performance of each method by computing a 

range of loss functions: Average and mean absolute bias ( p ^_^ J2i<j< p ~ c ij\ an d 

r 1 1/2 

max i<:)< p \dij — Cij\ respectively), root mean squared error: 2 Yui<j< P (^ij — c «j) 2 an d 

Stein's loss: tr(CC _1 ) + logdet(CC~ 1 ) - p. Stein's loss can be derived from the KL 
divergence from N(0,C) to N(0,C). It is also the KL divergence between Gaussian 
copula densities (or joint distributions with continuous fixed margins and Gaussian 
copulas) up to a constant and is therefore natural to consider here. 

Results are in Table 1. Unsurprisingly the two methods are essentially indistin- 
guishable in the probit case. There are some computational benefits here, since in the 
copula model we avoid Metropolis-Hastings steps for the marginal distributions. In 
the continuous case our model also does well. When p is small and p < n the paramet- 
ric model tends to slightly outperform ours, as expected. But as p grows our model 
is increasingly competitive, even outperforming the true parametric model. This is 
especially true when p > n and may be attributed to the robustness of rank-based 
inference. Unlike the probit case there is some loss in computational efficiency but 
given our results it is an appealing option when we are not confident in a parametric 
model, especially if n is not large relative to p. 



5 Application: Political-Economic Risk 

Quinn (2004) considers measuring political-economic risk, a latent quantity, using 
five proxy variables in a Gaussian/probit factor model. The author defines political- 
economic risk as the risk of a state "manipulat [ing] economic rules to the advantage 
of itself and its constituents" following (North and Weingast, 1989, pp. 808) . The 
dataset includes five indicators recorded for 62 countries: independent judiciary, black 
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Table 1: Results of the simulation study. Recorded values are the difference between the 
loss under the regular factor model and under the copula model, so that positive values 
correspond to better performance of our model. 









Gaussian Margins 






Probit Margins 




p/n/k 


loss 


Mean 


SD 


Min 


Max 


Mean 


SD 


Min 


Max 


10/50/2 


aab 


-0.0034 


0.006 


-0.023 


0.013 


0.0011 


0.0027 


-0.0079 


0.011 




rmse 


-0.044 


0.074 


-0.28 


0.14 


0.015 


0.033 


-0.086 


0.13 




mab 


-0.01 


0.026 


-0.1 


0.037 


0.0042 


0.017 


-0.038 


0.068 




stein 


-0.077 


0.12 


-0.9 


0.1 


-0.0023 


0.071 


-0.47 


0.12 


100/50/3 


aab 


0.03 


0.0041 


0.021 


0.042 


0.0011 


0.0011 


-0.00096 


0.0039 




rmse 


3.4 


0.46 


2.5 


4.8 


0.14 


0.12 


-0.12 


0.43 




mab 


0.067 


0.026 


-0.0076 


0.13 


0.006 


0.014 


-0.032 


0.048 




stein 


6.8 


2.7 


-2.8 


13 


-0.36 


0.49 


-1.7 


0.57 


100/150/3 


aab 


0.018 


0.005 


-0.0016 


0.03 


0.00043 


0.0016 


-0.0024 


0.0056 




rmse 


2.1 


0.59 


-0.33 


3.4 


0.055 


0.19 


-0.27 


0.67 




mab 


0.039 


0.025 


-0.042 


0.093 


0.0015 


0.011 


-0.029 


0.036 




stein 


2.9 


1.4 


-7 


5.4 


-0.066 


0.19 


-0.57 


0.44 


250/150/5 


aab 


0.038 


0.0032 


0.024 


0.045 


0.00048 


0.001 


-0.003 


0.0032 




rmse 


12 


0.96 


7.5 


14 


0.15 


0.3 


-0.84 


0.96 




mab 


0.12 


0.024 


0.056 


0.18 


0.0046 


0.013 


-0.025 


0.046 




stein 


58 


5.8 


32 


70 


-0.36 


0.64 


-2.3 


1.5 



market premium, lack of appropriation risk, corruption, and GDP per worker (marginal 
distributions are shown in Fig. 3). Additional background on political-economic risk 
and on the variables in this dataset is provided by Quinn (2004), and the data are 
available in the R package MCMCpack. Quinn (2004) transforms the positive contin- 
uous variables GDP and black market premium by iog(a;) and log (a; + 0.001) (resp.). 
The latter transform must account for the presence of zeros. However, their dispro- 
portionate number (14/62 observations) still leaves a large spike in the left tail and 
the normality assumption is obviously invalid. Since Quinn (2004) has already implic- 
itly assumed a Gaussian copula, our model is a natural alternative to the misspecified 
Gaussian/probit model. 

To explore sensitivity to prior distributions we fit the copula model under several 
priors: GDP(3,l), N{0,1) and the iV(0,4) priors used by Quinn (2004). We use 
100,000 MCMC iterations and save every 10 sample after a burn-in of 10,000 iterations. 
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Figure 3: Distributions of 4 variables from the political risk dataset in Quinn (2004). The 
fifth, Ind.Jud, is binary with 34/62 ones. The spike in black market premium is from the 
log(x+0.001) transform when x=0. 

Standard MCMC diagnostics gave no indication of lack of convergence. Figure 4 shows 
posterior means and credible intervals for the scaled loadings under each prior. Note 
that the N(0, 4) prior, intended to be noninformative, is actually very informative. 
It pulls the scaled loadings (and hence the correlations) toward ±1, with especially 
pronounced influence in the binary variable Ind.Jud and the other categorical variables. 
The GDP prior instead shrinks toward zero, providing conservative estimates. 

We also compare our model to the Gaussian/probit specification in Quinn (2004), 
but using the GDP(3, 1) prior. Incorrectly assuming a normal distribution for log 
black market premium has an appreciable impact on inference. The copula correlation 
between GDP and black market premium is underestimated in the Gaussian/probit 
model: mean —0.33 and 95% HPD interval (—0.46, —0.22) as opposed to —0.56 and 
(—0.73, —0.40) under our model. Figure 5 shows density estimates of draws from the 
bivariate posterior predictive of black market premium and GDP. The Gaussian/probit 
model is clearly not a good fit, assigning very little mass to the bottom-right corner 
(which contains almost 25% of the data). The Gaussian copula factor model assigns 
appropriately high density to this region. Estimates of the latent variables are impacted 
as well: Figure 6 plots the mean factor scores from each model (after shifting and 
scaling to a common range) for low-risk countries. The seven countries with the lowest 
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-1.0 -0.5 0.0 0.5 1.0 



Figure 4: (Left) Three induced priors on the scaled loadings (Right) Posterior mean/90% 
HPD intervals for scaled factor loadings under the different priors. Note the discrepancies 
are larger for discrete variables, and largest for Ind.Jud (which is binary). Evidently the 
-/V(0, 4) prior is quite informative. 

risk have identical covariate values except on GDP. Our model infers mean scores that 
are sorted by GDP (higher GDP yielding a lower score). The Gaussian/probit model 
instead assigns these countries almost identical scores. 

6 Discussion 

In this paper we have developed a new semiparametric approach to the factor anal- 
ysis of mixed data which is both robust and efficient. We propose new default prior 
distributions for factor loadings which are more suited to routine use of this model 
(and similar models, such as probit factor models). As a byproduct we also induce at- 
tractive new priors on correlation matrices in Gaussian copula models; these are both 
more flexible and parsimonious than the inverse Wishart prior used by Hoff (2007), 
and much more efficient computationally than the graphical model priors of Dobra 
and Lenkoski (2011). They admit optimal parameter expansion schemes which are 
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Posterior Predictive Density Estimates 
GCFM Quinn (2004) 




GDP.Per.Worker 

Figure 5: Posterior predictive distributions of log GDP and log black market premium, with 
observed data overlain. Note the cluster of points in the bottom-right corner; even though 
they represent over 20% of the sample the predictive density from the model in Quinn (2004) 
assigns very little mass to this area. 
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Figure 6: Comparison of the political-economic risk ranking obtained via our model and the 
mixed-data factor analysis of Quinn (2004). Points are posterior means and lines represent 
marginal 90% credible intervals. 
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easy to implement, and are readily extended to informative specifications and other 
latent variable models. We provide bfa, a freely available R package utilizing compiled 
code to implement our proposed methods (and other models used in this paper as well). 

We have not considered the issue of uncertainty in the number of factors, but it 
is straightforward to do so by adapting existing methods for Gaussian factor models. 
These include stochastic search (Carvalho et al., 2008), reversible jump MCMC (Lopes 
and West, 2004), model selection via Bayes factors (Ghosh and Dunson, 2009; Lopes 
and West, 2004) and nonparametric priors (Bhattacharya and Dunson, 2011; Paisley 
and Carin, 2009). The latter are especially promising when interest lies in C since 
they preserve the computational advantages of factor-analytic priors while providing 
full support on correlation matrices (which fails for fixed k < p). Treatment of k is 
only one part of the larger issue of model assessment in the copula framework, which is 
somewhat challenging given our semiparametric model but can be reasonably assessed 
through posterior predictive simulations. 

Several extensions of our model are possible. We can consider the larger elliptical 
copula family via scale mixing over (r)i,ei). Our theoretical results on the extended 
rank likelihood in Section 2 are readily extended to other copula families provided only 
some regularity conditions on the copula density. Finally, the simple exploratory-type 
factor model we consider here may be extended to any number of more sophisticated 
latent variable models, allowing correlated latent factors or incorporating covariates 
both at the level of the observed variables and the latent factors. 



A Conditional independence 

Assume F(Yi,Y2,Y 3 ) has a Gaussian copula with correlation matrix C, that Y3 is 
discrete, and that r 12 = 0. Let (Z u Z 2 , Z 3 ) ~ N(0,C) and B c = (F 3 (c - 1), F 3 (c)] 
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for c in the domain of Y 3 and define gj(z 3 ) = $ (Fj(yj) — Cj 3 z 3 )/(1 — c^) 1 / 2 ). It is 
straightforward to show that 

Pr(Ki < y±\Y 3 = c)Pr(Y 2 < y 2 \Y 3 = c) = E( 9l (z 3 ))E(g 2 (z 3 )) (A.l) 
Pr(Xi < Vl , Y 2 <y 2 \Y 3 = c) = E( gi (z 3 )g 2 (z 3 )) (A.2) 

where the expectations are with respect to ir(z 3 \y 3 = c) = TN(0, 1, F 3 (c — l),F 3 (c)) 
and (A.2) holds because ir(zi, z 2 \z 3 ) = ir(zi\z 3 )ir(z 2 \z 3 ) when r\ 2 = 0. Since gi,g 2 are 
monotone it is well known that E^g^z^ g 2 (z 3 )) ^ E(gi(z 3 ))E(g 2 (z 3 )) (and Yi,Y 2 are 
dependent given Y 3 ) unless one or both functions are a.s. constant, which occurs only if 
one or both of Y\, Y 2 are marginally independent of Y 3 {c\ 3 c 23 = 0). This result extends 
to conditioning on one discrete variable and any number of continuous variables since 
conditioning on a continuous variable I4 = 2/4 implies that Pr{zi = $ _1 F(?/4)) = 1, and 
7r(z 3 \y 3 , Z4) is again univariate truncated normal (with a different mean and variance). 

B Proof of Theorem 1 

Proof. We require a variant of Doob's theorem, presented in Gu and Ghosal (2009): 

Doob's Theorem. Let Xi be observations whose distributions depend on a parameter 
9, both taking values in Polish spaces. Assume 9 ~ II and Xi\9 ~ Pg. Let Xn be the 
a -field generated by X%, . . . , X^ and = cr{\J°l 1 X^ . If there exists a X^ measurable 
function f such that for (u>,9) G Q°° x O, 9 = f(u>) a.e. [P e °° x II] then the posterior 
is strongly consistent at 9 for almost every 9 [II] . 

Therefore we must establish the existence of a consistent estimator of C which is 
measurable with respect to the a-field generated by the sequence {D(Y^)}^ =1 (a 
coarsening of the cx-field generated by {^ (m) }^ = i). Let R nij = Yl=i 1 ([Vhj < Vij]) = 
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nFj(yij). Let H ni (Y^ nS) ) be the p-vector with entry j given by R ni j and let R fl (Y"' n ') = 
{R„i}™ =1 . Observe that the information contained in the extended rank likelihood 
(namely the boundary conditions in the definition of the set D(Y^)) is equivalent 
to the information contained in R n (Y^). Hence a function that is measurable with 
respect to lZ n , the a-field generated by {R rn (Y^)}'^ n=1 , is also measurable with respect 
to the cr-field generated by {-D(^K (m ' l }^ =1 and we may work exclusively with the former. 

Let U nij = ^ and U ni = {U ni i, . . . U nip )' . Then U nij Uy where C/y = Fj-(y„) 
by the SLLN, so U ni Ui and therefore U is IZoo = crl\J^ =1 IZi) measurable. Note 
that if Fj is discrete C/y is merely a relabeling of y^ (each category/integer is "labeled" 
with its marginal cumulative probability). So Ui is a sample from a Gaussian copula 
model with correlation matrix Co where the continuous margins are all U[0, 1] and the 
discrete marginal distributions are completely specified. The problem of estimating 
C from Ui reduces to estimating ordinary and polychoric/polyserial correlations with 
fixed marginals and it is straightforward to verify that the distribution of U{ is a 
regular parametric family admitting a consistent estimator of C, say h^iUi, ■ ■ ■ , Un)- 
Therefore there exists a sequence of 7^oo measurable functions hx(U\, ■ ■ ■ , Un) ~^ 
h(U\, C7 2 , ■■■) = C almost surely and 

C = h{U u U 2 , ...) = h*({R m : N > 1, 1 < i < N}) a.s. [G^ Fi _ Fp ] (B.l) 

where (B.l) holds because a null set under the measure induced by R n {Y^) is also 
null under Gq qFi ^ f . □ 



C Validity of the PX Sampler 



Let be the inferential parameters and let Sj = Zj(I — H'j{^ ■ 1 + HjH'j) 1 Hj)z'j 



Our working prior for (v-y, . . . v p ) is n^=i IG(Vj] n /2, no/2). To verify that samples of 



3 '■ 

26 



from the PX-Gibbs sampler have stationary distribution ir(@\Y) we need to show 
that as no — > the transition kernels under the marginal sampling scheme (alternately 
drawing from 7r(W|0, Y) and 7r(0| W)) and the blocked sampling scheme (alternately 
drawing from 7r(W|0, V, Y) and 7i"(0, V|W)) converge (Meng and Van Dyk, 1999). 
The t th updates under the two schemes £1X6 £IS follows: 

Scheme 1: Draw l/v 2 0] ~ Ga(n /2,n /2) and ~ Ga (^n ^+v^ y Set r = 

Vjo/vji and draw \j ~ N(rXj', + HH'y 1 ) 

Scheme 2: Draw l/v 2 tj ~ Ga w °+"(t-D^A g et r = y^-fa. an d draw Aj ~ 

N(rXj , (VJ 1 + HH'y 1 ) 

Updates for the rest of under both schemes are the same as in Section 3.2. As 
n — > under Scheme 1 the distribution of 1/vq, approaches a point mass at 1 and 
Scheme 1 converges to Scheme 2 with n = 0. 
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